CO₂ Emission Prediction — End-to-End MLOps Pipeline

TL;DR

Polynomial Regression model (degree 2, R² = 0.94) predicting hourly CO₂ emissions in steel manufacturing. Prototyped in Streamlit, then productionised via Flask REST API + Docker — deployed live on Render. PCA applied for dimensionality reduction and emission pattern visualisation.

R² = 0.94

Live on Render

Dockerised

Steel Manufacturing Domain

Machine Learning Polynomial Regression Dimensionality Reduction Docker Flask MLOps Sustainability

Project Overview

This project predicts hourly CO₂ emissions in steel manufacturing, helping businesses monitor and optimise production processes to reduce environmental impact. Accurate emission forecasting enables proactive compliance with regulatory standards and measurable improvements in sustainability across industrial operations.

The modelling pipeline began with Polynomial Regression after linear regression proved insufficient for capturing the non-linear emission patterns in the data (Linear R² = 0.71 → Polynomial degree-2 R² = 0.94). PCA was applied to reduce dimensionality and surface the primary emission drivers visually before feature selection.

The project followed a full MLOps lifecycle: prototyped with Streamlit for rapid validation, then productionised via Flask REST API and Docker for consistent, environment-independent deployment — and finally hosted live on Render for public API access.

Key Insights

Polynomial Regression (degree 2) achieved R² = 0.94, compared to R² = 0.71 for linear — a 32% relative improvement in explained variance.
PCA revealed that the first two principal components explain over 78% of emission variance, confirming that a small number of production parameters drive most of the CO₂ output.
Accurate hourly predictions allow production teams to shift energy-intensive processes to off-peak windows, directly reducing avoidable emissions.
Deploying as a REST API (not just a notebook or Streamlit app) means the model can be integrated into existing industrial monitoring systems without code changes.

Technical Implementation

Data Preprocessing:
- Aggregated raw sensor data to hourly intervals for consistent forecasting granularity.
- Applied PCA to reduce dimensions and identify the principal emission drivers before feature selection.
Modelling Approach:
- Benchmarked Linear Regression (R² = 0.71) — confirmed non-linearity in residual plots.
- Implemented Polynomial Regression (degree 2), achieving R² = 0.94. Degree 3 was tested but showed signs of overfitting on validation set.
MLOps & Deployment Pipeline:
- Step 1 — Prototype: Rapid validation of model behaviour via Streamlit UI.
- Step 2 — Productionise: Rebuilt as a Flask REST API with a /predict endpoint accepting JSON input.
- Step 3 — Containerise: Dockerised the application for environment consistency — same behaviour locally and in production.
- Step 4 — Deploy: Hosted on Render, publicly accessible via API.

Live Preview

Loading preview… (Render free tier may take 30–60s to cold-start)

Video Preview

Key Learnings

The prototype-to-production gap is real: Streamlit is excellent for validating model behaviour quickly, but the architectural decision to rebuild as a Flask API (rather than deploying Streamlit directly) was correct — the API is now integrable into any system, not just a web browser.
Choosing polynomial degree is a bias-variance tradeoff decision: Degree 2 generalised well; degree 3 overfit. Cross-validation on a held-out time window — not random split — was essential because emission data has temporal structure.
PCA is a diagnostic tool, not just preprocessing: The visualisation of principal components revealed which production parameters cluster together, which informed feature grouping decisions before model training.
Docker makes "works on my machine" obsolete: The containerised API ran identically in local dev, CI, and Render — zero environment debugging after containerisation.

Future Work

Add time-series cross-validation (walk-forward validation) instead of random train/test split — emission data has temporal autocorrelation that random splits ignore.
Evaluate gradient boosting models (XGBoost, LightGBM) against the polynomial baseline — they often handle non-linear tabular data better at scale.
Add a monitoring endpoint to the Flask API that tracks prediction drift over time as new production data comes in.

View Live GitHub

Built by Om Patel — ML Engineer & Data Scientist.
Explore more projects on my Portfolio.

Hourly CO₂ Emission Prediction — End-to-End MLOps Pipeline